Preface

In this problem set we will exercise some of the unsupervised learning approaches on 2016 Global Health Observatory data. It is available at that website in the form of Excel file, but its cleaned up version ready for import into R for further analyses is available at CSCI E-63C canvas course web site whs2016_AnnexB-data-wo-NAs.txt. The cleaning and reformatting included: merging data from the two parts of Annex B, reducing column headers to one line with short tags, removal of “>”, “<” and whitespaces, conversion to numeric format and replacement of undefined values (as indicated by en-dash’es in the Excel) with corresponding averages of those attributes. The code that was used to format merged data is shown at the end of this document for your reference only. You are advised to save yourself that trouble and start from preformatted text file available at the course website as shown above. The explicit mapping of short variable names to their full description as provided in the original file is available in Excel file whs2016_AnnexB-reformatted.xls also available on the course canvas page. Lastly, you are advised to download a local copy of this text file to your computer and access it there (as opposed to relying on R ability to establish URL connection to canvas that potentially requires login etc.)

Short example of code shown below illustrates reading this data from a local copy on your computer (assuming it has been copied into current working directory of your R session – getwd() and setwd() commands are helpful to find out what is it currently and change it to desired location) and displaying summaries and pairs plot of five (out of almost 40) arbitrary chosen variables. This is done for illustration purposes only – the problems in this set expect use of all variables in this dataset.

whsAnnBdatNum <- read.table("whs2016_AnnexB-data-wo-NAs.txt",
                            sep="\t",header=TRUE,quote="",encoding="ISO-8859-1")
summary(whsAnnBdatNum[,c(1,4,7,10,17)])
##      TOTPOP          LIFEXPB.B       MEDBIRTHS          NEWHIV      
##  Min.   :      2   Min.   :50.10   Min.   :  9.00   Min.   : 0.100  
##  1st Qu.:   1876   1st Qu.:66.03   1st Qu.: 78.50   1st Qu.: 0.300  
##  Median :   8070   Median :72.50   Median : 96.00   Median : 1.563  
##  Mean   :  37696   Mean   :71.29   Mean   : 84.28   Mean   : 1.563  
##  3rd Qu.:  26413   3rd Qu.:76.38   3rd Qu.: 99.00   3rd Qu.: 1.563  
##  Max.   :1383925   Max.   :83.70   Max.   :100.00   Max.   :20.100  
##      ALCONS      
##  Min.   : 0.000  
##  1st Qu.: 2.900  
##  Median : 6.085  
##  Mean   : 6.085  
##  3rd Qu.: 8.700  
##  Max.   :17.400
pairs(whsAnnBdatNum[,c(1,4,7,10,17)])

In some way this dataset is somewhat similar to the USArrests dataset extensively used in ISLR labs and exercises – it collects various continuous statistics characterizing human population across different territories. It is several folds larger though – instead of 50 US states and 4 attributes in USArrests, world health statistics (WHS) data characterizes 194 WHO member states by 37 variables. Have fun!

The following problems are largely modeled after labs and exercises from Chapter 10 ISLR. If anything presents a challenge, besides asking questions on piazza (that is always a good idea!), you are also encouraged to review corresponding lab sections in ISLR Chapter 10.

Problem 1: Principal components analysis (PCA) (25 points)

The goal here is to appreciate the impact of scaling of the input variables on the result of the principal components analysis. To that end, you will first survey means and variances of the attributes in this dataset (sub-problem 1a) and then obtain and explore results of PCA performed on data as is and after centering and scaling each attribute to zero mean and standard deviation of one (sub-problem 1b).

Sub-problem 1a: means and variances of WHS attributes (5 points)

Compare means and variances of the untransformed attributes in the world health statisics dataset – plot of variance vs. mean is probably the best given the number of attributes in the dataset. Function apply allows to apply desired function (e.g. mean or var or sd) to each row or column in the table. Do you see all 37 attributes in the plot, or at least most of them? (Remember that you can use plot(inpX,inpY,log="xy") to use log-scale on both horizontal and vertical axes.) Is there a dependency between attributes’ averages and variances? What is the range of means and variances when calculated on untransformed data? Which are the top two attributes with the highest mean or variance? What are the implications for PCA rendition of this dataset (in two dimensions) if applied to untransformed data?

apply(whsAnnBdatNum, 2, var)
##       TOTPOP    LIFEXPB.M    LIFEXPB.F    LIFEXPB.B      HLIFEXP 
## 1.995191e+10 5.615186e+01 6.599687e+01 6.005323e+01 4.922454e+01 
##      MATMORT    MEDBIRTHS    MORTLT5YO     NEONMORT       NEWHIV 
## 5.071617e+04 4.682497e+02 1.063194e+03 1.274194e+02 6.646701e+00 
##        TBINC      MALARIA     HEPBVACC     INTINTDS   PRMORT3070 
## 2.491708e+04 8.987006e+03 1.725019e+02 1.958163e+15 3.033572e+01 
##      SUICIDE       ALCONS     MORTRAFF   MODFAMPLAN    ADOBIRTHS 
## 4.397049e+01 1.564679e+01 9.855015e+01 3.077933e+02 2.209936e+03 
##      MORTAIR     MORTWASH     MORTPOIS       SMOK.M       SMOK.F 
## 1.953398e+03 4.593396e+02 4.359931e+00 1.169154e+02 7.418528e+01 
##       DRDENS  AVEI13HRCCS     STUNT5YO      WAST5YO        OW5YO 
## 2.571266e+03 4.325628e+02 1.211796e+02 1.556119e+01 1.513093e+01 
##    BETTERH2O    BETTERSAN    CLEANFUEL     AVEPM2.5      DEATHND 
## 1.913805e+02 7.996170e+02 1.301432e+03 4.093812e+02 4.330536e-01 
##     HOMICIDE    DEATHCONF 
## 1.384443e+02 5.451967e+02
plot(apply(whsAnnBdatNum, 2, mean), apply(whsAnnBdatNum, 2, var), 
     log = "xy", xlab = "mean", ylab = "variance")

The means and variances do seem to have some positive correlation, which might be linear in log-log space. For the most part, the values of means fall somewhere between 1 and 100, while the variances are maybe somewhere between 10 and 10,000. There are two major outliers, which also are the two features with highest means and variances. They are INTINTDS (total number of people requiring interventions against NTDs) and TOTPOP (total population). I’m a little baffled how the mean number of people requiring this sort of intervention is higher than the mean of the total population, but that’s what the data says.

Because these two variables have a variance that completely dwarfs that of the rest of the data, it’s likely that PCA reduction of the data would mainly capture the variation observed in these two features and likely miss more interesting correlations in the rest of the data.

Sub-problem 1b: PCA on untransformed and scaled WHS data (20 points)

Perform the steps outlined below both using untransformed data and scaled attributes in WHS dataset (remember, you can use R function prcomp to run PCA and to scale data you can either use as input to prcomp the output of scale as applied to the WHS data matrix or call prcomp with parameter scale set to TRUE). To make it explicit, the comparisons outlined below have to be performed first on the unstransformed WHS data and then again on scaled WHS data – you should obtain two sets of results that you could compare and contrast.

  1. Obtain results of principal components analysis of the data (by using prcomp)
  2. Generate scree plot of PCA results (by calling plot on the result of prcomp)
  3. Generate plot of the two first principal components using biplot. Which variables seem to predominantly drive the results of PCA when applied to untransformed data?
  • Please note that in case of untransformed data you should expect biplot to generate substantial number of warnings. Usually in R we should pay attention to these and understand whether they indicate that something went wrong in our analyses. In this particular case they are expected – why do you think that is?
  1. The field rotation in the output of prcomp contains loadings of the 1st, 2nd, etc. principal components (PCs) – that can interpreted as contributions of each of the attributes in the input data to each of the PCs.
  • What attributes have the largest (by their absolute value) loadings for the first and second principal component?
  • How does it compare to what you have observed when comparing means and variances of all attributes in the world health statistics dataset?
  1. Calculate percentage of variance explained (PVE) by the first five principal components (PCs). You can find an example of doing this in ISLR Chapter 10.4 (Lab 1 on PCA).
usPCA <- prcomp(whsAnnBdatNum)
plot(usPCA)

biplot(usPCA, pc.biplot = TRUE)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

It appears that the explained variance of the first variable is many orders of magnitude higher than that of any of the others. Looking at the biplot, it seems this is because the INTINTDS variance was so much higher than anything else, and so the first PC feature is essentially dominated by INTINTDS and (because the red feature names are spread out a little bit on the zero line) any correlation between other features and this one.

usPCA$rotation[, c("PC1", "PC2")]
##                       PC1           PC2
## TOTPOP       2.216726e-03  9.999974e-01
## LIFEXPB.M   -2.241646e-08  1.232984e-05
## LIFEXPB.F   -2.866965e-08  1.270936e-05
## LIFEXPB.B   -2.569642e-08  1.254383e-05
## HLIFEXP     -2.400391e-08  1.193143e-05
## MATMORT      6.035795e-07 -3.598408e-04
## MEDBIRTHS   -8.818667e-08  3.214788e-05
## MORTLT5YO    1.144968e-07 -4.865985e-05
## NEONMORT     5.286241e-08 -1.619241e-05
## NEWHIV      -5.326614e-10 -4.688615e-07
## TBINC        4.634768e-07 -1.389049e-04
## MALARIA      6.078734e-08 -1.665374e-04
## HEPBVACC    -5.116404e-08  1.372377e-05
## INTINTDS     9.999975e-01 -2.216726e-03
## PRMORT3070   1.705779e-08 -2.597847e-06
## SUICIDE      1.208446e-08  1.617659e-06
## ALCONS      -5.139552e-09  2.392531e-06
## MORTRAFF     1.262362e-08 -3.637722e-06
## MODFAMPLAN  -2.125818e-09  2.349719e-05
## ADOBIRTHS    5.677051e-08 -7.290218e-05
## MORTAIR      1.691722e-07  3.096639e-05
## MORTWASH     7.199476e-08 -3.157988e-05
## MORTPOIS     5.358021e-09 -4.700044e-07
## SMOK.M      -1.739610e-08  7.817615e-06
## SMOK.F      -3.289046e-08 -4.406133e-06
## DRDENS      -1.433164e-07  1.411690e-05
## AVEI13HRCCS  3.228241e-08  3.846087e-05
## STUNT5YO     5.439533e-08 -2.235368e-05
## WAST5YO      2.141885e-08 -5.431101e-06
## OW5YO       -1.363438e-08  6.125457e-07
## BETTERH2O   -3.050988e-08  1.930157e-05
## BETTERSAN   -1.282282e-07  3.449176e-05
## CLEANFUEL   -1.493207e-07  3.806764e-05
## AVEPM2.5     1.022098e-07  1.705293e-05
## DEATHND      4.115525e-10 -6.978771e-08
## HOMICIDE    -6.359166e-09 -4.154007e-06
## DEATHCONF   -9.921508e-09 -1.250904e-06

As I predicted earlier, INTINTDS has the vast majority of contribution towards PC1, and population has the highest contribution towards PC2.

usPVE <- usPCA$sdev^2 / sum(usPCA$sdev^2)
sum(usPVE[1:5])
## [1] 1

The calculation shows that essentially all of the variance is explained by the first five PCA dimensions. Of course, this is a somewhat bogus result. Let’s see how the results differ with scaling.

sPCA <- prcomp(whsAnnBdatNum, scale. = TRUE)
plot(sPCA)

biplot(sPCA, pc.biplot = TRUE)

selectCountries <- c(
  "China",
  "India",
  "Switzerland",
  "Zimbabwe",
  "Australia",
  "Japan",
  "Peru",
  "Honduras"
)
plot(sPCA$x[,1:2]); text(sPCA$x[selectCountries,1:2], selectCountries)

sPCA$rotation[, c("PC1", "PC2")]
##                      PC1          PC2
## TOTPOP       0.004884629  0.006151152
## LIFEXPB.M   -0.243610019 -0.045716073
## LIFEXPB.F   -0.253078454 -0.038740331
## LIFEXPB.B   -0.250555808 -0.044163504
## HLIFEXP     -0.251577110 -0.062282515
## MATMORT      0.232562940 -0.060483184
## MEDBIRTHS   -0.208518840  0.117008115
## MORTLT5YO    0.248011043 -0.017169766
## NEONMORT     0.240496335  0.019160321
## NEWHIV       0.057243300  0.175346962
## TBINC        0.161651357  0.132898176
## MALARIA      0.139047669 -0.223200459
## HEPBVACC    -0.134296639  0.110438222
## INTINTDS     0.050226781 -0.045675955
## PRMORT3070   0.118627350 -0.021286435
## SUICIDE     -0.061782041 -0.338173644
## ALCONS      -0.096091735 -0.231083867
## MORTRAFF     0.161517262  0.258381664
## MODFAMPLAN  -0.110861607  0.290400834
## ADOBIRTHS    0.210892027  0.086389966
## MORTAIR      0.106229940 -0.188556510
## MORTWASH     0.222913074 -0.120690341
## MORTPOIS     0.161571637 -0.210227555
## SMOK.M       0.008364455 -0.071481443
## SMOK.F      -0.093569876 -0.357586945
## DRDENS      -0.159402565 -0.232060185
## AVEI13HRCCS -0.153098290  0.076947082
## STUNT5YO     0.171187159 -0.146795674
## WAST5YO      0.117042181 -0.202157355
## OW5YO       -0.090929875  0.071399248
## BETTERH2O   -0.212469885  0.009767579
## BETTERSAN   -0.230942609 -0.003392105
## CLEANFUEL   -0.224286228  0.038256931
## AVEPM2.5     0.086627851  0.123334535
## DEATHND      0.009362133 -0.075613737
## HOMICIDE     0.040297314  0.391234680
## DEATHCONF    0.017241234  0.055725129
sPVE <- sPCA$sdev^2 / sum(sPCA$sdev^2)
sum(sPVE[1:5])
## [1] 0.6304403

With the scaled data, the results appear to be much better. Most of the features give meaningful contributions to the first two principal components. There are quite a few contributions which are negative, which I would interpret to mean that there is negative correlation between these features and the positive features. In other words, as a feature such as CLEANFUEL goes down, features such as STUNT5YO in the data tend to go up, and therefore lower CLEANFUEL values will reduce the value of PC1 for that particular transformed sample.

In addition, the spread of countries shown on the plot of the first two PCA components is much more sensible with less clustering around the x=0 and y=0 lines. This is also apparent in the biplot which has far larger and more diffuse red arrows for most or all of the features. Finally, explained variance of the first five principal components is far reduced to 0.63.

Now that you have PCA results when applied to untransformed and scaled WHS data, please comment on how do they compare and what is the effect of scaling? What dataset attributes contribute the most (by absolute value) to the top two principal components in each case (untransformed and scaled data)? What are the signs of those contributions? How do you interpret that?

Please note, that the output of biplot with almost 200 text labels on it can be pretty busy and tough to read. You can achieve better control when plotting PCA results if instead you plot the first two columns of the x attribute in the output of prcomp – e.g. plot(prcomp(USArrests,scale=T)$x[,1:2]). Then given this plot you can label a subset of countries on the plot by using text function in R to add labels at specified positions on the plot. Please feel free to choose several countries of your preference and discuss the results. Alternatively, indicate US, UK, China, India, Mexico, Australia, Israel, Italy, Ireland and Sweden and discuss the results. Where do the countries you have plotted fall in the graph? Considering what you found out about contributions of different attributes to the first two PCs, what do their positions tell us about their (dis-)similarities in terms of associated health statistics?

Problem 2: K-means clustering (20 points)

The goal of this problem is to practice use of K-means clustering and in the process appreciate the variability of the results due to different random starting assignments of observations to clusters and the effect of parameter nstart in alleviating it.

Sub-problem 2a: k-means clusters of different size (5 points)

Using function kmeans perform K-means clustering on explicitly scaled (e.g. kmeans(scale(x),2)) WHS data for 2, 3 and 4 clusters. Use cluster attribute in the output of kmeans to indicate cluster membership by color and/or shape of the corresponding symbols in the plot of the first two principal components generated independently on the same (scaled WHS) data. E.g. plot(prcomp(xyz)$x[,1:2],col=kmeans(xyz,4)$cluster) where xyz is input data. Describe the results. Which countries are clustered together for each of these choices of \(K\)?

scWHS <- scale(whsAnnBdatNum)
scPCA <- prcomp(scWHS)
selectCountries <- c(
  "China",
  "India",
  "Switzerland",
  "Zimbabwe",
  "Australia",
  "Japan",
  "Peru",
  "Honduras"
)

for (cl in 2:4) {
  kmWHS <- kmeans(scWHS, cl)
  plot(scPCA$x[,1:2], col=kmWHS$cluster)
  text(sPCA$x[selectCountries,1:2], selectCountries)
  title(paste(cl, "clusters"))
}

With the selected countries I used, rich first-world countries are always grouped together (Australia, Japan, Switzerland). Interestingly, when there are three clusters many of the countries towards the left side of the plot get grouped together and there are two smaller groups on the right side. One of the smaller groups contains India and Zimbabwe. These two countries are also grouped seperately from the rest in the 4-cluster grouping, but the in-between countries are now in their own cluster away from the rich countries.

Sub-problem 2b: variability of k-means clustering and effect of nstart parameter (15 points)

By default, k-means clustering uses random set of centers as initial guesses of cluster centers. Here we will explore variability of k-means cluster membership across several such initial random guesses. To make such choices of random centers reproducible, we will use function set.seed to reset random number generator (RNG) used in R to make those initial guesses to known/controlled initial state.

Using the approach defined above, repeat k-means clustering of explicitly scaled WHS data with four (centers=4) clusters three times resetting RNG each time with set.seed using seeds of 1, 2 and 3 respectively (and default value of nstart=1). Indicate cluster membership in each of these three trials on the plot of the first two principal components using color and/or shape as described above. Two fields in the output of kmeanstot.withinss and betweenss – characterize within and between clusters sum-of-squares. Tighter clustering results are those which have smaller ratio of within to between sum-of-squares. What are the resulting ratios of within to between sum-of-squares for each of these three k-means clustering results (with random seeds of 1, 2 and 3)?

Please bear in mind that the actual cluster identity is assigned randomly and does not matter – i.e. if cluster 1 from the first run of kmeans (with random seed of 1) and cluster 4 from the run with the random seed of 2 contain the same observations (country/states in case of WHS dataset), they are the same clusters.

Repeat the same procedure (k-means with four clusters for RNG seeds of 1, 2 and 3) now using nstart=100 as a parameter in the call to kmeans. Represent results graphically as before. How does cluster membership compare between those three runs now? What is the ratio of within to between sum-of-squares in each of these three cases? What is the impact of using higher than 1 (default) value of nstart? What is the ISLR recommendation on this offered in Ch. 10.5.1?

One way to achieve everything this sub-problem calls for is to loop over nstart values of 1 and 100, for each value of nstart, loop over RNG seeds of 1, 2 and 3, for each value of RNG seed, reset RNG, call kmeans and plot results for each combination of nstart and RNG seed value.

for (ns in c(1, 100)) {
  for (seed in 1:3) {
    set.seed(seed)
    kmWHS <- kmeans(scWHS, centers=4, nstart=ns)
    plot(scPCA$x[,1:2], col=kmWHS$cluster)
    text(sPCA$x[selectCountries,1:2], selectCountries)
    title(paste(ns, "random sets with seed", seed))
    print(paste(ns, "random sets with seed", seed, ":", kmWHS$tot.withinss/kmWHS$totss))
  }
}

## [1] "1 random sets with seed 1 : 0.583001243787346"

## [1] "1 random sets with seed 2 : 0.595791576047246"

## [1] "1 random sets with seed 3 : 0.613151533293285"

## [1] "100 random sets with seed 1 : 0.583001243787346"

## [1] "100 random sets with seed 2 : 0.583001243787346"

## [1] "100 random sets with seed 3 : 0.583001243787346"

With nstart=1, the clusters do differ from run to run. This difference is not apparent when nstart is set to 100. In fact the ratio of withinness to betweenness is exactly the same between the three runs. The textbook recommends always using a high value for nstart (>20) in order to ensure the algorithm does not end up in a local minimum on a single run.

Problem 3: Hierarchical clustering (15 points)

Sub-problem 3a: hierachical clustering by different linkages (10 points)

Cluster country states in (scaled) world health statistics data using default (Euclidean) distance and “complete”, “average”, “single” and “ward” linkages in the call to hclust. Plot each clustering hierarchy, describe the differences. For comparison, plot results of clustering untransformed WHS data using default parameters (Euclidean distance, “complete” linkage) – discuss the impact of the scaling on the outcome of hierarchical clustering.

scWHS <- scale(whsAnnBdatNum)
scPCA <- prcomp(scWHS)
selectCountries <- c(
  "China",
  "India",
  "Switzerland",
  "Zimbabwe",
  "Australia",
  "Japan",
  "Peru",
  "Honduras"
)
methods <- c(
  "complete", 
  "average", 
  "single",
  "ward.D"
)
for (method in methods) {
  d <- dist(scWHS)
  scHC <- hclust(d, method = method)
  plot(scHC)
}

It’s a bit difficult to tell the difference, I unfortunately don’t have the time to wrestle with plotting sizes to get the plots to be a little bit more interpretable.

Sub-problem 3b: compare k-means and hierarchical clustering (5 points)

Using function cutree on the output of hclust determine assignment of the countries in WHS dataset into top four clusters when using Euclidean distance and “complete” linkage. Use function table to compare membership of these clusters to those produced by k-means clustering with four clusters in the Problem 2(b) when using nstart=100 (and any of the RNG seeds) above. Discuss the results.

Appendix: pre-processing of WHS data

For your reference only – the file it has generated is already available at our course website

whsAnnBdat <- read.table("../data/whs2016_AnnexB-data.txt",sep="\t",header=T,as.is=T,quote="")
dim(whsAnnBdat)
whsAnnBdat <- apply(whsAnnBdat,2,function(x)gsub(">","",gsub("<","",gsub(" ","",x))))
whsAnnBdat <- apply(whsAnnBdat,2,function(x){x[x==rawToChar(as.raw(150))]<-"";x})
rownames(whsAnnBdat) <- whsAnnBdat[,1]
whsAnnBdat <- whsAnnBdat[,-1]
whsAnnBdatNum <- apply(whsAnnBdat,2,as.numeric)
whsAnnBdatNum <- apply(whsAnnBdatNum,2,function(x){x[is.na(x)] <- mean(x,na.rm = TRUE);x})
rownames(whsAnnBdatNum) <- rownames(whsAnnBdat)
write.table(whsAnnBdatNum,"../data/whs2016_AnnexB-data-wo-NAs.txt",quote=F,sep="\t")